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ABSTRACT 

The  research  in  this  paper  focuses  on  the  I/O 
problem  associated  with  a  parallel  application 
writing  to  a  single  physical  disk.  Included  in  our 
research  are  the  original  ideas  that  led  to  the  first 
version  of  the  parallel  software,  subsequent 
versions  of  the  software  derived  from  lessons 
learned  from  benchmark  results,  and  speedup 
results  of  each  version.  The  underlying  purpose  of 
this  software  is  to  process  hydrographic  data 
having  a  complicated,  multi-tiered  format.  The  data 
processing  involves  reading  tens  to  hundreds  of 
files  containing  raw  data,  filtering  out  extraneous 
data  values,  and  writing  the  filtered  data  to  a  single 
file  used  in  additional  processing.  The  problem  is 
not  computationally  intensive,  but  bound  by  the 
system’s  file  writing  capability.  Results  show  that 
the  more  responsible  the  software  was  for 
organizing  the  data  before  writing,  the  better  the 
speedup.  The  critical  factor  for  writing  data 
efficiently  involved  the  limitation  of  writing  data 
over  a  single  I/O  controller.  Our  parallel  software 
has  fantastic  utility  where  system  specifications  do 
not  allow  for  the  use  of  parallel  file  systems,  or 
writing  data  over  multiple  I/O  controllers. 
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1.  INTRODUCTION 


Space  Center  has  been  tasked  to  develop  ways  to 
speedup  hydrographic  data  processing  at  the  Naval 
Oceanographic  Office  (NAVOCEANO)  [1].  This 
paper  presents  the  final  development  of  a 
parallelized  version  of  the  Pfm  loader  application 
customized  to  run  on  a  Beowulf  cluster. 

This  paper  specifically  covers  software 
development  efforts  in  early  FY02  [2].  In  a  series 
of  algorithms,  called  Schemes  D,  E,  and  F,  a 
parallel  algorithm  previously  implemented  in 
Scheme  C  (see  Fig.  1)  has  been  integrated  with  an 
improved  version  of  NAVOCEANO’s  Pure  File 
Magic  (PFM)  library.  Former  versions  of  this 
software  not  mentioned,  namely  Schemes  A  and  B, 
were  used  as  a  design  platform  for  the  software 
architecture  found  in  Scheme  C.  The  main  goal  of 
this  work  was  to  increase  the  writing  rate  of  binned 
data  to  a  physical  disk.  The  final  software  version 
is  Scheme  F. 

The  final  parallel  code  of  Scheme  F  achieves  the 
best  speedup  for  the  largest  available  test  dataset. 
The  simple  runs  (no  filtering)  exhibited  a  speedup 
of  10  as  compared  to  the  original,  serial  algorithm. 
Runs  with  swath  filtering  showed  a  top  speedup  of 
8.  Runs  with  area  filtering  reached  a  speedup  of 
6.5.  Runs  with  both  swath  and  area  filtering 
showed  a  top  speedup  of  7.  In  all  cases,  data 
strongly  suggest  that  greater  speedups  could  be 
achieved  for  larger  than  tested  input  datasets. 
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Section  2  of  this  paper  is  devoted  to  Scheme  D. 
Section  3  presents  results  on  an  implemented 
threaded  version  of  Scheme  D.  Section  4  deals  with 
results  for  Scheme  E.  Section  5  describes  Scheme 
F.  Detailed  results  are  presented  on  the 
performance  of  Scheme  F  when  swath  and  area 
based  filtering  are  enabled.  Section  6  presents 
results  illustrating  the  effect  of  the  physical  disk’s 
I/O  throughput  rates  on  speedup  results.  Section  7 
discusses  results  and  future  plans. 


•  The  I/O  node  receives  sounding  data  in  the 
PFM  compressed  form  and  copies  them  into 
the  PFM-style  depth  blocks  consisting  of  six 
sounding  data  and  a  continuation  pointer. 

•  The  master  node  role  has  not  been  changed 
from  Scheme  C. 

Results 

Fig.  2  illustrates  achieved  speedups  for  different 


Fig.  1.  Scheme  C  (the  precursor  to  the  schemes  presented  in  this  paper) 


2.  SCHEMED 

In  Scheme  D,  as  in  Scheme  C,  nodes  are  assigned 
one  of  the  following  roles  —  a  master  node,  an  I/O 
node,  or  a  slave  node.  In  general,  more  of  the 
functionality  of  the  original  PFM  library  has  been 
assigned  to  the  slave  nodes  in  Scheme  D  than  in 
earlier  schemes. 

•  Each  slave  node  sort’s  read-in  sounding  data 
according  to  bin  index  and  compresses  them 
in  the  same  way  as  is  done  in  the  original 
PFM  library.  The  sending  buffer  is  now  in 
the  form  of  a  character  string. 


loads.  The  test  datasets  are  denoted  by  the  number 
of  files  tested  (7,  12,  24,  48  and  74  files).  Timings 
have  been  averaged  over  four  runs.  Speedup 
figures  are  created  by  comparing  parallel  and  serial 
runs.  In  the  case  of  algorithm  speedups,  the 
comparisons  are  made  with  corresponding  timings 
for  three  Message  Passing  Interface  (MPI) 
processes.  Speedups  exhibited  by  Scheme  D  reach 
about  7  for  the  four  larger  dataset  loads.  For  the 
smallest  dataset,  Scheme  D  shows  a  speedup  of 
around  5,  reducing  the  run  time  from  around  2  min 
to  23  sec.  For  the  largest  input  dataset  tested,  the 
measured  speedup  is  7  fold,  reducing  execution 
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time  from  around  24  min  to  3  min  20  sec.  The 
optimal  number  of  MPI  processes  is  9  for  this 
scheme. 


Fig.  2.  Achieved  speedup  of  Scheme  D  for 
different  loads. 


3.  THREADING 

Initial  Plan 

The  overall  processing  flow  can  be  improved  by 
grouping  tasks  into  at  least  two  threads  on  the  I/O 
node  and  a  slave  node.  Tasks  related  to  MPI 
communication  can  be  accomplished  by  a  separate 
thread.  A  testing  program  mixing  MPI  and 
threading  confirmed  that  MPI  Chamleon  (MPICH) 
allows  only  a  single  thread  to  execute  MPI  calls. 
On  the  I/O  node,  a  separate  communication  thread 
would  take  care  of  receiving  buffers.  The  other 
thread  would  be  responsible  for  writing  data  to  the 
PFM  output  file.  On  the  slave  node,  a 
communication  thread  would  send  full  buffers  to 
the  I/O  node.  The  second  thread  would  transfer  and 
process  sounding  data  from  input  files  to  buffers. 
Tests  are  planned  to  check  if  the  currently  used 
dual-CPU  system  boards  will  efficiently  support 
two  application  threads.  A  quad  system  board  for 
the  I/O  node  could  be  used  if  that  proved 
beneficial. 

Threading  the  I/O  Node:  Algorithm 

Only  one  part  of  the  above  plan  has  been 
implemented:  Scheme  D  has  been  tested  using 
Linux  Posix  threads  on  the  I/O  node  only. 
Processing  on  the  I/O  node  has  been  separated  into 
two  threads.  The  second  thread  submits  data  to  an 


output  PFM  file.  The  main  thread  does  all 
necessary  initialization  and  then  creates  one 
additional  thread,  called  the  I/O  processing  thread. 
The  main  thread  acts  as  the  I/O  communication 
thread.  All  MPI  function  calls  are  done  only  by  the 
I/O  communication  thread.  The  I/O  communication 
thread  receives  buffers  from  worker  nodes  and 
submits  them  to  a  work  queue.  The  I/O  processing 
thread  checks  the  work  queue  for  any  full  buffers, 
and  uses  PFM  function  calls  to  write  data  to  the 
PFM  output  file.  An  empty  buffer  is  returned  to  the 
work  queue  area.  The  algorithm  implements  reuse 
of  buffers  in  order  to  avoid  reinitializing  buffers. 
All  access  to  the  work  queue  is  safeguarded  by  “a 
mutex  lock”  (Mutually  Exclusive  Access  Lock) 
mechanism,  a  standard  tool  available  in  the  thread 
library.  A  set  of  a  few  empty  buffers  is  initialized 
in  advance  on  the  I/O  communication  side.  If  all 
empty  buffers  on  the  I/O  communication  node  are 
used,  that  thread  enters  the  work  queue  and  gets 
their  empty  buffers.  If  no  empty  buffers  are 
available  in  the  work  queue  area,  the  I/O 
communication  thread  initializes  a  fresh  buffer. 
Only  if  initializing  the  fresh  buffer  fails,  the  I/O 
communication  thread  waits  for  the  I/O  processing 
thread  to  return  a  buffer.  The  creation  of  additional 
buffers  is  always  expected  because  PFM  library 
operations  are  the  slowest  part  of  processing  (since 
these  library  operations  require  multiple  accesses  to 
the  physical  disk  drive). 

Threading  the  I/O  Node:  Results 

Fig.  3  illustrates  achieved  speedups  for  different 
loads.  Speedup  values  for  the  Scheme  D-Threaded 
version  are  noticeably  smaller  than  the  original 
Scheme  D.  Such  results  could  be  attributed  to  the 
additional  overhead  of  the  thread  library.  However, 
results  for  the  largest  test  dataset  (74  files)  are 
especially  disappointing,  due  to  the  lack  of  control 
of  the  memory  usage  on  the  I/O  node  in  the 
threaded  code.  This  preliminary  threaded  version 
has  no  control  on  the  number  of  buffers  used  to 
keep  incoming  buffers  on  the  I/O  node.  Since 
worker  nodes  deliver  buffers  much  faster  than  the 
I/O  node  could  possibly  write  them  to  a  (slow) 
physical  disk,  incoming  buffers  forced  the 
operating  system  on  the  I/O  node  to  use  disk  swap 
space,  causing  a  significant  slow  down  in 
processing. 
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for  different  loads 


4.  SCHEME  E 
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Fig.  4.  Achieved  speedup  of  Scheme  E  for 
different  loads 


Grouping  Sounding  Data  into  Packages  of  6 

The  improved  version  of  the  PFM  library,  as  well 
as  the  original  library,  writes  data  to  the  physical 
disk  in  fixed  length  blocks.  Each  block  can  hold  up 
to  six  sounding  values  (the  value  of  six  is 
configurable).  To  take  advantage  of  these  blocks, 
sounding  data  are  sent  in  groups  of  six  (with  the 
same  bin  index).  At  first,  slave  nodes  send  depths 
grouped  into  packets  of  six  (with  same  bin  indexes) 
to  create  full  depth  records  in  PFM  style.  When  all 
input  file  names  have  been  distributed  (and  most  of 
the  files  have  been  already  processed)  the  finishing 
slave  nodes  have  some  leftover  depth  data. 
Separately,  each  node  cannot  create  a  final  full 
depth  record  containing  6  sounding  values  for  one 
bin  index.  Under  the  direction  of  the  master  node, 
the  first  slave  node  to  finish  processing  becomes  a 
sorting  and  grouping  node.  Then,  the  other  slave 
nodes  send  their  leftover  data  to  that  sorting  node 
for  consolidation.  Subsequently,  full  records  are 
sent  to  the  sorting  and  grouping  node  (the  one 
which  currently  is  accepting  data)  with  the  final 
leftover  data  sent  to  the  I/O  node.  This  additional 
effort  of  sorting  into  groups  of  six  depths  has  the 
advantage  of  increasing  speed  of  writing  the  data  to 
disk  as  well  as  avoiding  any  need  to  reread  and 
rewrite  partially  empty  depth  records.  Further 
improvements  include  a  new  interface  function  to 
the  PFM  library  to  accommodate  direct  writing  of 
blocks  to  the  PFM  file.  Also,  slave  nodes  are  given 
the  additional  task  of  creation  of  complete  depth 
blocks.  This  improvement  reduces  the  I/O  node’s 
task  to  simply  updating  the  continuation  pointers 
before  writing  data  to  disk. 


Results 

Fig.  4  illustrates  achieved  speedups  for  different 
loads,  compared  with  the  original  serial 
application.  Timings  have  been  averaged  over  four 
runs.  A  comparison  between  speedup  numbers  for 
Schemes  D  and  E  shows  that  values  for  Scheme  E 
are  noticeably  smaller,  except  for  the  largest  test 
dataset.  For  the  smallest  dataset,  Scheme  E  shows  a 
speedup  around  4.5,  reducing  the  run  time  from 
around  2  min  to  26  sec.  For  the  largest  input 
dataset  tested,  the  measured  speedup  is  around  7, 
reducing  execution  time  from  around  24  min  to  3 
min  19  sec.  The  optimal  number  of  MPI  processes 
is  10  for  larger  datasets  in  Scheme  E. 

Filtering,  Recomputing  and  Final  Tune-up 

The  original  filtering  functionality  works  well  in 
the  Beowulf  cluster  environment.  The  original 
swath  filtering  was  being  handled  separately  for 
each  input  file.  Thus  the  new  distributed  scheme  of 
processing  input  files  by  a  group  of  slave  nodes  has 
no  effect  on  swath  filtering.  The  same  has  been 
found  for  the  recomputing  step  and  area  based 
filtering,  which  run  without  any  changes  to  the 
code  because  they  are  executed  in  the  serial  phase, 
on  the  I/O  node,  only  after  the  PFM  file  has  already 
been  created. 

5.  SCHEME  F 

A  tuned  version  of  the  new  PFM  library  was  used 
in  Scheme  F.  Scheme  F  was  tested  with  four 
different  setups,  involving  two  possible  filtering 
procedures:  swath  filtering,  area  filtering,  both 
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swath  and  area  filtering,  and  no  filtering  (called 
“simple  runs”).  Each  of  these  four  setups  has  been 
tested  with  the  standard  data  test  loads.  Results 
with  no  filtering  are  illustrated  in  Fig.  5. 


Fig.  5.  Achieved  speedups  of  Scheme  F  with  no 
Filtering 

Scheme  F  achieved  the  best  speedup  for  the  largest 
available  test  dataset.  The  simple  runs  (no  filtering) 
exhibited  a  speedup  of  10.  Runs  with  swath 
filtering  enabled  showed  a  top  speedup  of  8.  Runs 
with  area  filtering  reached  a  speedup  of  6.5.  Runs 
with  both  swath  and  area  filtering  enabled  showed 
a  top  speedup  of  7.  In  the  simple  runs,  swath 
filtering  and  area-based  filtering  are  turned  off. 

Fig.  6  illustrates  achieved  speedups  for  different 
loads  and  filter  processing.  For  the  smallest  dataset, 
when  swath  filtering  is  turned  on,  worker  nodes 
filter  the  sounding  data  by  swath  before  assembling 
them  and  sending  to  the  I/O  node.  Since  swath 
filtering  puts  additional  processing  onto  worker 


Fig.  6.  Scheme  F  with  Area  Filtering 


nodes,  such  code  behavior  is  to  be  expected. 
Worker  nodes  perform  swath  filtering  on  sounding 
data  read  from  the  input  files.  When  area  filtering 
is  also  turned  on,  the  I/O  node  also  performs  area 
filtering  of  sounding  data.  Since  area  filtering  is 
done  exclusively  by  the  I/O  node  after  all  sounding 
data  has  been  written  to  the  PFM  file,  the  area 
filtering  is  performed  in  the  serial  phase  of  overall 
processing. 

6.  THE  ROLE  OF  HARD  DRIVE 
PERFORMANCE 

This  section  presents  arguments  to  support  the 
claim  that  the  application  Pfmloader  is  I/O  bound. 
When  Scheme  D  was  developed,  upgrading  the 
BIOS  on  each  Beowulf  cluster  node  resulted  in 
significant  improvement  in  the  throughput  of 
cluster  IDE  (ATA  100)  disks.  These  changes 
significantly  affected  the  run  time  of  parallel  codes 
as  well  as  the  original  serial  code.  The  changes  for 
the  original  serial  code  range  from  5%  to  1 1%  (see 
table  1).  The  resulting  speedup  is  presented  in  fig. 
7.  In  the  case  of  algorithm  speedups,  the 
comparisons  are  done  with  corresponding  timings 
for  three  MPI  processes.  The  changes  range  from 
29%  to  57%,  which  at  least  triples  that  of 
corresponding  percentage  changes  for  serial  runs. 


7.  SUMMARY 


The  optimal  number  of  MPI  processes  for  the 
simple  runs  with  large  datasets  is  9.6.  When  area 
filtering  is  turned  on,  the  I/O  node  filters  sounding 
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data  by  geographic  area.  Since  area  filtering  is 
done  exclusively  by  the  I/O  node  after  all  sounding 
data  has  been  already  written  to  the  PFM  file,  this 
processing  adds  to  the  length  of  the  serial  phase  in 
the  overall  processing.  The  serial  processing  nature 
of  area  filtering  causes  the  greatest  reduction  in 
performance.  The  results  generated  by  testing  two 
of  the  four  types  of  processing  provide  arguments 
for  increasing  the  size  of  the  Beowulf  cluster. 
These  two  processing  types  are  (a)  processing  with 
swath  filtering  enabled,  and  (b)  processing  with 
both  swath  and  area  filtering  enabled.  For  the 
largest  test  dataset,  both  processing  methods 
achieved  their  best  speedups  when  running  with  the 
maximal  possible  number  of  MPI  processes. 
However,  the  same  data  indicates  the  speedup 
increase  would  be  modest. 

In  all  cases,  the  data  strongly  suggests  that  a 
speedup  greater  than  10  could  be  achieved  for 
larger  datasets.  This  property  is  a  positive 
characteristic  of  the  current  parallel  code.  It  also 
strongly  suggests  that  the  five  test  datasets  have  not 
yet  pushed  the  current  code  and  cluster 
configuration  to  its  limits. 
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TABLE  1 

Timings  of  change  in  hard  drive  performance 

Number  of  Generic  Sensor  Format  (GSF) 
input  files 

7  |  12  |  24  |  48  |  74 

Timings  of  serial  code,  before  BIOS  upgrade 
125.1  |  203.5  |  439.3  |  946.0  |  1607.1 

Timings  of  serial  code,  with  BIOS  upgrade 
118.4  |  192.2  |  412.3  |  893.7  |  1433.2 

Percentage  change 

5.33%  I  5.58%  I  6.13%  I  5.53%  I  10.82% 
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